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Abstract: Biological proteins are known to fold into specific 3D conformations. However, 
the fundamental question has remained: Do they fold because they are biological, and 
evolution has selected sequences which fold? Or is folding a common trait, widespread 
throughout sequence space? To address this question arbitrary, unevolved, random-sequence 
proteins were examined for structural features found in folded, biological proteins. 
Libraries of long (71 residue), random-sequence polypeptides, with ensemble amino acid 
composition near the mean for natural globular proteins, were expressed as cleavable 
fusions with ubiquitin. The structural properties of both the purified pools and individual 
isolates were then probed using circular dichroism, fluorescence emission, and fluorescence 
quenching techniques. Despite this necessarily sparse "sampling" of sequence space, 
structural properties that define globular biological proteins, namely collapsed conformations, 
secondary structure, and cooperative unfolding, were found to be prevalent among unevolved 
sequences. Thus, for polypeptides the size of small proteins, natural selection is not necessary 
to account for the compact and cooperative folded states observed in nature. 
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1. Introduction 

Fifty years ago, it was observed that proteins fold to complex, aperiodic structures. This observation 
created a conundrum: since a molecule of such high molecular mass was expected to be overwhelmed 
by conformational entropy, it was therefore expected that protein folds would resemble amorphous 
materials such as liquids or glasses, rather than the observed native states, more reminiscent of organic 
crystals [1]. Although there was at the time no physical theory accounting for how proteins reached 
stable and specific folds, Darwin's theory of natural selection did explain why proteins folded — so as 
to acquire specific biochemical functions. This fusion of structural biology and Darwinian theory 
implied that unevolved sequences {i.e., random-sequence proteins) would rarely have biological 
fianction and would likely be disordered [2,3]. 

Over time however, a number of observational [4-7], experimental [8] and theoretical [9-11] 
results have contradicted these early assumptions and it now appears that unique, compact protein 
folds may be much more common throughout sequence space than once presumed. For example, 
structural studies of natural and engineered mutant proteins have revealed globular, biological proteins 
to be extremely tolerant of amino acid substitutions [12], deletions [13], insertions including segmental 
substitutions [14] and circular permutations [15]. Studies comparing structurally homologous proteins 
from different species demonstrated that a mere 5-20% of a given protein's amino acid sequence 
remains invariant during evolution, and thus, point mutations appear to be tolerated at the majority of 
positions. Indeed, it has been demonstrated that sequences with little if any measurable sequence 
identity can nonetheless fold to identical 3D structures [16,17]. These observations imply, and 
subsequent theoretical studies have predicted, that any native fold is associated with a set of sequences, 
often connected by single amino acid mutations. These interconnected sets of sequences are referred to 
as neutral networks [18]. Distinct neutral networks are presumably interwoven in sequence space and 
might in some cases exist in close proximity to one another [19,20]. Rather than being statistical 
anomalies, native folds may be ubiquitous. Indeed, native folds associated with extensive neutral 
networks are said to have high "designability" [21]. 

These insights into the relationship between sequence, structure and natural selection are circumstantial 
however, confounded by the genealogical relationships inherent to biological data sets. Perhaps this 
mutational robustness is itself a derived, evolutionary adaptation? To acquire information about protein 
folding unbiased by evolution, it is necessary to perform structural characterizations on sequences 
that are explicitly outside biological and engineered data sets. Such synthetic, random-sequence 
polypeptides would have no sequence information relationship to evolved proteins nor to each other. 
Any physiochemical properties associated with these random-sequence polypeptides would therefore 
be independent of natural selection and historically contingent constraints of genealogical descent. 

As the raw material for evolution, the occurrence of native, or even partially folded conformations 
among random-sequence polypeptides would have profound implications for the diversification and 
emergence of folds in the course of evolution. Despite the very large size of protein sequence space 
(20^, where N is the length of the sequence), even a relatively small number of such random-sequence 
polypeptides could potentially test the early assumptions that unevolved proteins must be disordered. 

Along these lines, there have been a number of experimental studies that have synthesized and 
characterized random-sequence polypeptides. The earliest attempts at chemical synthesis of proteins 
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preceded molecular cloning techniques and solid-phase synthesis methods. At that time, amino acid 
polymers were produced by random co-polymerization of mixed amino acid N-carboxyanhydrides, 
containing side-chain blocking groups where necessary [22]. Solution synthesis methods lack control 
over product length, thus a wide range of polymer sizes was present in any particular sample, with a 
major fraction often far longer than biological proteins. Furthermore, the copolymers contained only a 
few types of amino acids, leading to unnaturally biased samples of sequence space. Litriguingly, 
despite these limitations, evidence of solubility and compactness was demonstrated [23,24]. 

More recently, genetically-encoded combinatorial libraries of arbitrary proteins have been 
developed, allowing more precise control over the length and composition of the synthetic arbitrary 
sequences. Davidson and Sauer [25,26] used synthetic DNA templates encoding 70-90 amino acid 
positions to express in E. coli, arbitrary sequences of glutamine (Q), arginine (R) and leucine (L). 
Although circular dichroism, gel filtration and analytical centrifLigation demonstrated evidence for 
secondary structure including cooperative unfolding, these QRL sequences, presumably because of 
their extreme degenerate amino acid composition, were unlike biological proteins in being remarkably 
resistant to chemical and thermal denaturation. QRL polypeptides were also relatively insoluble in the 
absence of denaturant. Doi et al. [27] used synthetic DNA templates encoding 115-144 amino acid 
positions, using 5 types of amino acids (valine, alanine, asparagine, glutamine, glycine: VANQG). 
Intriguingly, this limited "primordial" amino acid composition was found to have higher solubility 
than biological proteins composed of 20 amino acid types with comparable hydrophobicity. However, 
these VANQG proteins demonstrated less secondary structure than did the QRL sequences, 
presumably due to the presence of the helix breaker, glycine. Prijambada et al. [28] expressed 
random proteins having all 20 amino acids (95 random positions in a 141 synthetic construct) although 
they found that 20% of these sequences were soluble, they did not report detailed biophysical 
characterizations of individual clones. Most recently, Chiarabelli et al. [29] used phage display 
techniques and a tripeptide thrombin tag to isolate 79 random-sequence proteins having 20 amino acids 
types, with evidence that 20% were likely folding. However, the sequences analyzed in this study were 
only 50 residues in length, putting them at the extreme low-end for globular structures. 

The strategy introduced here combines a new method for encoding such polypeptides with an 
expression system capable of efficiently producing arbitrarily long, compositionally controlled 
random-sequence polypeptides [30] (see Figure 1 and Experimental Section). The random-sequence 
proteins were expressed in E. coli as carboxy-terminal fusions with ubiquitin via the pNMHUBpoly 
plasmid [31]. One of the remarkable properties of the expression system is that fused proteins are 
stabilized by ubiquitin during expression, yet they are able to maintain their own autonomous 
structure [32]. Expression of proteins as fiisions with ubiquitin in this system has been shown to result in 
high product yield (up to 60 mg fiision per liter culture) using a simple purification procedure [33,34]. 
The fusions can be processed using specific ubiquitin-carboxy extension hydrolases which faithfully 
cleave fusions at the C-terminus of ubiquitin [35]. The present approach represents a successful 
"shotgun" sampling of protein sequence space and lends itself well to the characterization of diverse 
pools of novel proteins and large numbers of individual sequences. We emphasize that our purpose 
here was not to probe the detailed structure of individual isolates, but to observe the generic folding 
properties of proteins sampled randomly from a defined compositional domain of sequence space 
where biological proteins are known to exist. This is an analysis of sequence space, not molecules. 
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As such, the exact amino acid sequences or the details of folded conformation of the isolates are not 
necessary in answering the questions we pose or to the conclusions reported herein. 

Figure 1. Cloning scheme for production of plasmid libraries encoding random-sequence 
fiisions with ubiquitin. (a) The plasmid pNMHUBpoly contains a synthetic gene which 
encodes human ubiquitin and is controlled by a X-phage PL promoter. The double-stranded 
fragments, thll52 and thll32, were produced by restriction digestion of DNA obtained 
from PCR on the synthetic oligonucleotides described in the text. These inserts were 
designed with a codon distribution encoding a biologic-like amino acid composition (see 
details in Section 3.1). Ligation of a Bgl II/Asp 718 digested thll32 fragment with the Bam 
HI/Asp 718 digested plasmid library results in loss of the original Bam HI site and 
introduction of a new site downstream, thus additional fragments can be inserted by 
redigestion of plasmid with Bam HI without internal cleavage of the novel genes; and 
(b) The insertion of a single thll52 fragment into the poly-cloning site following the 
ubiquitin gene results in 38 amino acid fiisions as shown beneath the DNA. Insertion of 
one or more thll32 fragments results in elimination of an in-frame termination codon, 
addition of a down-stream stop codon, and elongation of the open reading frame by 33 
codons per fragment. The libraries examined herein were LIB38 which contained a single 
insert and LIB71 which contained two inserts. 
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2. Results and Discussion 

2.1. Demonstration of Secondary Structure in Random-Sequence Fusion Protein Pools 

The purified, novel fusion proteins (see Section 3.2 for purification details) were examined 
spectroscopically for signatures of folding commonly found in natural globular proteins. First, the 
secondary structure content of the pool samples was examined by circular dichroism (CD) spectroscopy 
(Figure 2). The characteristic a-helix peaks near 195, 208, and 222 nm are clearly increased for the 
library samples over those of ubiquitin, indicating helical regions within some of the random-sequence 
extensions. The CD spectra were deconvoluted using LINCOMB and the five basis spectra provided 
with the program [36]. The estimates of helix content match the conclusions drawn from visual 
inspection of the spectra. After subtracting the contribution due to ubiquitin from the library spectra, it 
was estimated that the random-sequences of LIB38 contained 23.6% a-helix and 20.5% P-sheet, while 
LIB71 contained 30.5% a-helix and 10.9% P-sheet. Interestingly, these estimates are within the range 
of secondary structure contents found in natural proteins [37]. The secondary structure values for the 
flision libraries were calculated assuming a mean residue weight of 1 14 daltons and a mean length of 
33.3 and 56.2 amino acids for the extensions in LIB38 and LIB71, respectively. Mean lengths were 
calculated using the expected termination codon frequency. LINCOMB deconvolution of control CD 
spectra (ubiquitin alone and as a mixture with myoglobin) gave helix and sheet estimates as expected 
based on known 3D structures [38]. 

Figure 2, Circular dichroism (CD) spectra of ubiquitin and two libraries of random- 
sequence flisions. Spectra for ubiquitin (black line), UbLIB38 (gray line), and UbLIB71 
(dotted line) represent averages over five or ten scans. Data are given in mean residue 
elliptic ity. 
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To further demonstrate absolute differences between the secondary structure content of control 
ubiquitin and a fusion pool sample, CD spectra were monitored during the heat-induced loss of 
secondary structure (Figure 3). Melting experiments demonstrated the reduction of helical content in 
LIB71 and stability of the ubiquitin spectrum during the change from 25 °C to 75 °C. Ubiquitin has 
been shown to denature at approximately 85 °C [39]. The decrease in ellipticity observed for LIB71 
protein was essentially linear with increasing temperature, as might be expected for a complex mixture 
of proteins with distinct melting temperatures (data not shown). The loss of CD signal with increasing 
temperature clearly demonstrates, independent of possible artifacts from data normalization and 
deconvolution, the presence of heat-labile secondary structure in the random-sequence extensions. 

Figure 3, Effect of temperature on CD spectra of ubiquitin and fusion pool. CD spectra of 
(a) ubiquitin and (b) LIB71 are shown at 25 °C (solid line) and 75 °C (broken line). Data 
were collected as described in the Experimental Section, except using 10 repeat scans from 
195 to 225 nm, in 10 °C increments with 10 minute equilibration time. 
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2.2. Estimation of Folded Structure in Individual Random-Sequence Fusion Proteins 

Individual fusions were chosen from the libraries at random with the stipulation that they contained 
DNA inserts of proper size as shown by restriction mapping. CD spectra were obtained for fifteen 
individual fusion proteins at 10 °C. Of the secondary structure estimates from CD data, helix estimates 
have been shown most reliable from controls here and previously [40], therefore, these helix values are 
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used in further analyses. During processing of data on individual clones, normalization using polymer 
length was avoided since the lengths of individual sequences will deviate somewhat from the mean 
length of polymers in the library. Helix data from whole fusions are given in Table 1. 



Table 1. Helix content estimated by deconvolution of circular dichroism (CD) spectra. 



Pool 


Isolates 


Helix content 


Isolate averages 


LIB38 




18.6* 






38 mm 


14.7* 






38a 


16.4 






38c 


15.7 






38e 


16.4 






38j 


16.2 


16.6 ± 1.9 




38k 


13.8* 






38m 


19.3* 






38o 


17.6* 






38p 


19.6* 




LIB71 




22.4* 






71c 


19.5* 






71d 


18.2* 






71e 
71h 


14.4* 
17.0* 


17.0 ±1.8 




vij 


15.9 






71L 


17.1* 





Ubiquitin 16.1 ±0.04 

" Data are given for whole fusions in order to avoid ubiquitin subtraction, since the exact length of 
individual novel extensions may deviate from the ensemble mean length. Means and standard 
deviation. * denotes greater than two standard deviations from the measured helix content in the 
ubiquitin control. 



Next, individual protein fusions were probed for hydrophobic collapse by measuring their ability to 
protect intrinsic tryptophan side chains from solvent water. The maximum wavelength of tryptophan 
emission is determined by the polarity of its environment. With excitation at 280 nm, tryptophan 
fluoresces at or below -345 nm in a non-polar organic solvent or the interior of a protein, while in 
water the emission maximum is red shifted to -355 nm or above [41]. Thus, fluorescence emission 
(FE) measurement of the wavelength of maximum tryptophan fluorescence can be used as an indicator 
of structural collapse or compactness. Addition of sufficient denaturant to a native fold causes protein 
unfolding, exposure of protected tryptophan side chains, and a corresponding red-shift of the emission 
maximum. Ubiquitin contains no tryptophan and so its FE spectrum in the region of 345-355 nm is 
minimal and unaffected by denaturant. Any change observed in the FE spectra of the ubiquitin flisions 
upon addition of denaturant can therefore be attributed to alteration in the conformation of the random- 
sequence domain. 

The two protein pools as well as 25 tryptophan-containing individual fusions were examined using 
FE spectroscopy under native and denaturing conditions. The spectra for LIB38 and LIB71 both show 
a red shift and decrease in intensity in the presence of 6 M GuHCI, consistent with protection of 
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tryptophan from water in the absence of denaturant and exposure in its presence (Figure 4). This 
observation provides evidence of collapsed states in a large population of proteins in each fusion 
library. The change in emission intensity is also an indicator of structural alterations near tryptophan 
residues; however the direction and magnitude of such changes are unpredictable [41]. The reason for 
a larger red-shift and smaller intensity change in the LIB71 spectra versus that for LIB38 remains 
unclear. A large majority (19 out of 25) of the individual fusions examined also showed tryptophan 
protection. To test refolding of the random-sequence, reprotection of tryptophan was examined after 
performing buffer exchange from 6 M GuHCl back into Tris buffer. Reprotection was demonstrated 
for all six fusions tested (data not shown), thus collapsed structure is a property of the novel 
extensions, independent of in vivo protein biosynthesis. 

Figure 4. Fluorescence emission (FE) spectra of fusion protein pools (a) LIB38; and (b) 
LIB71. The spectra were taken in water (solid lines) and in 6 M GuHCl (broken lines). 
Decreased intensity and red shift of the emission wavelength maximum in GuHCl indicate 
protein unfolding and tryptophan exposure. The fluorescence emission of ubiquitin in this 
range was low and unaffected by GuHCl. 
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The minimum structural requirements necessary for the protection of tryptophan from solvent have 
not been extensively investigated. However, using overlapping fragments of nicotinic acetylcholine 
receptor, it was observed that a 12-mer was incapable of protecting an intrinsic tryptophan but an 
18-mer and a 32-mer were shown to collapse [42]. The present results suggest hydrophobic collapse is 
a rather common property of arbitrary, random-sequence polypeptides with amino acid compositions 
similar to biological proteins. 
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2.3. Cooperative Unfolding Shown by Intrinsic Fluorescence Emission 

Having found evidence for helical structure and hydrophobic collapse, we next probed the 
cooperativity of the unfolding transitions in the random-sequence domains. The wavelength of maximum 
tryptophan fluorescence was monitored with increasing denaturant concentration to determine if loss 
of collapsed structure occurred gradually (non-cooperative) or rapidly (cooperative). Twelve flisions 
were denatured by incremental addition of GuHCl followed by equilibration and measurement of FE 
spectra. Two data sets and their unfolding profiles are shown (Figure 5). Three of the twelve fusions 
gave unfolding profiles similar to that for Ub71h (Figure 5b-c) in which the red shift of the FE peak 
occurred almost uniformly across the entire range of denaturant concentrations. This type of profile is 
indicative of non-cooperative unfolding. The majority of fiisions (nine out of twelve) yielded unfolding 
profiles resembling that for UbVlL (Figure 5a-c) in which the red shift occurred non-uniformly. In 
these cases, a rapid increase in FE peak wavelength over a fairly small GuHCl concentration range 
indicated cooperative transitions. In all nine cases, the cooperative unfolding transitions occurred at 
GuHCl concentrations between 1 and 1.5 M which is within the low end of the range observed for 
unfolding of biological proteins [44]. This result does not distinguish between native and partially-folded 
structures, since cooperative unfolding has been observed previously in partially-folded designed 
proteins [43-45] and in molten globule states of natural proteins [46]. 

Figure 5, FE unfolding curves for two fusions with increasing GuHCl. For Ub71L (a) and 
UbVlh (b). GuHCl concentrations: solid circles 0.0 M, open circles 0.2 M, solid squares 
0.5 M, open squares 1 M, sohd triangles 1.6 M, open triangles 2.6 M, solid diamonds 3.6 
M, open diamonds 4.5 M. Spectra were also taken in 6 M GuHCl (not shown), (c) Fraction 
unfolded was calculated by dividing total observed red shift by the red shift at each 
denaturant concentration. Data are given for Ub71L attached to ubiquitin (solid circles) and 
cleaved from ubiquitin (open circles) and for Ub71h attached to ubiquitin (sohd squares). 
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Figure 5. Cont. 

0.95 
0.85 
0.75 
0.65 
0.55 
0.45 

315 330 345 360 375 

Wavelength (nm) 

(b) 

1 

0.8 
0.6 
0.4 
0.2 
0 

0 1 2 3 4 5 6 

GuHCI Concentration (M) 

(c) 

2.4. Ubiquitin Independent Unfolding and Refolding 

One of the natural functions of C-terminal ubiquitin fusion is to minimize the degradation of poorly 
folding proteins during biosynthesis. Here, the influence of ubiquitin fiision on the folding of random 
sequence proteins was tested by cleaving the extensions from ubiquitin and examining their FE spectra 
with and without denaturant. UbCE hydrolase LI was used to cleave ubiquitin fusions specifically 
between the C-terminal amino acid of ubiquitin and the N-terminal residue of the extensions. Eight 
proteins, which demonstrated tryptophan protection as ubiquitin fusions, were completely cleaved (as 
shown by SDS-PAGE), denatured with GuHCl, and refolded by buffer exchange. All eight of the 
clones displayed some degree of tryptophan protection in buffer and loss of protection in 6 M GuHCl. 
The unfolding of cleaved protein 71L was monitored with increasing GuHCl concentration. Its unfolding 
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profile resembles that of the fusion protein, but appears shghtly less cooperative (Figure 5(c)). These 
observations demonstrate that the folding of this arbitrary-sequence polypeptide is largely independent 
of attachment to ubiquitin. 

2.5. Molten Globule-Like and Native-Like Behavior Suggested by ANS Binding Data 

The fluorescent dye l-anilino-naphthalene-8-sulfonate (ANS) has been shown previously to bind 
with higher affinity to the molten globule forms of natural proteins than to the native or unfolded 
states [47]. Presumably, this binding is due to the loose, and therefore accessible, hydrophobic core 
characteristic of the molten globule state. The intensity of emission by ANS is increased when the dye 
is sequestered from water within the interior of a protein, therefore binding is evident in FE spectra. 
Three fusion proteins were tested for association with ANS in buffer and in 2 M GuHCl (Figure 6). 
Ub38mm was chosen as a negative control, since it shows no evidence of a collapsed conformation, 
failing to protect an intrinsic tryptophan. UbVlh and Ub71L were examined because, while both protect 
tryptophan residues, the former unfolds non-cooperatively and the latter with marked cooperativity. 
While ubiquitin is stable in 2 M GuHCl [48], the two proteins fi-om LIB71 are predominately unfolded 
at that concentration. 

Figure 6, FE spectra of ANS with proteins: ubiquitin (solid circles), Ub38 mm (open 
circles), Ub71L (open squares), and Ub71h (solid triangles). Samples contained 10 )J,M 
protein and 250 \iM ANS in 12.5 mM KH2P04, 50 mM NaF, pH 7.5 containing either no 
GuHCl (a); or 2 M GuHCl (b). Excitation wavelength was 340 nm. 
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As expected, the intensity of ANS emission with Ub38mm was unaffected by denaturant indicating 
the lack of a collapsed conformation with an accessible hydrophobic core for the protein in buffer. 
Also, as expected, the spectrum of ANS with ubiquitin remains unchanged after addition of GuHCl, 
since ubiquitin is very stable and excludes ANS under both conditions. The intriguing result lies in the 
comparison of the spectra from the two LIB71 proteins. UbVlh caused a significant increase in the 
intensity of ANS FE in buffer (Figure 6(a)), but not in the presence of denaturant (Figure 6(b)). This 
indicates the existence of a denaturant-sensitive, collapsed conformation containing a hydrophobic 
core which is accessible to large organic molecules, analogous to the molten globule states of natural 
proteins. On the other hand, Ub71L caused no increase in the intensity of ANS fluorescence in buffer. 
UbVlL was shown to be larger than UbVlh (by SDS-PAGE) and to exist in a collapsed state (by 
intrinsic tryptophan FE), therefore it appears the packing in its core may be sufficiently tight to exclude 
ANS. By this measure, the behavior of Ub71L resembles that of a native protein more closely than that 
of a molten globule. 

2.6. The Statistical Validity of a Random Sample of Molecular Sequences in vivo and in vitro 

Since we are utilizing these protein libraries as random samples of unevolved proteins, we must 
consider spurious selective pressxares applied during expression and purification. Since completely 
unfolded chains would be susceptible to degradation by host proteases, in vivo expression of polypeptides 
may automatically select for sequences with compact structure. In practice however, in vivo proteolysis 
was contraindicated by the fact that all thirty, randomly chosen, individual flisions examined were of 
greater apparent molecular weight than ubiquitin as shown by SDS-PAGE. 

If proteolysis had been prevalent, one would have expected to see clones of unflised ubiquitin, since 
there is no reason to believe the encoded, stable ubiquitin would be degraded along with the 
susceptible extension. In addition, flision with ubiquitin has been shown repeatedly to decrease 
proteolytic degradation of recombinant proteins [49] including a series of 10-21 residue peptides 
which are likely non-compact [35]. It is still possible that certain clones failed to grow at all. However, 
tight control of expression and brief induction followed by rapid purification limited the exposure of 
product to host cell contents and mitigated possible toxic effects on the cells. 

Although selection is implicit in the process of protein purification, it is clear from the data 
presented here that polypeptides displaying ordered, folded conformations are easily obtained from 
pools that have not been stringently selected for specific functions. While there exist on the order of 
lO'*^ and 10^^ possible sequences consistent with libraries LIB38 and LIB71, respectively, we have 
examined only a small number (approximately 30 individual proteins). Such profoundly sparse 
sampling would render a lack of evidence for structure uninterpretable, since failure to observe some 
property would not disprove its existence. On the other hand, positive evidence of structure, such as 
our observations of collapsed conformations, cooperative unfolding, and secondary structure, conclusively 
demonstrate that these properties are surprisingly common in sequence space. 

2. 7. Rudimentary Folds 

Within the compositional regime of protein sequence space examined here, it is evident that 
sequences having collapsed conformations with cooperative folding transitions and measurable levels 
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of secondary structure are common and easily discovered by a random search. These structural 
properties are not unlike those observed among induced molten globule states of perturbed native 
folds [50]. To differentiate the two, we refer to the observed, partially ordered conformational states of 
unevolved sequences as rudimentary folds. Since rudimentary folding seems to resemble a molten 
globule state more closely than an unfolded state, native folds are, most likely, also more common than 
previously suspected. Indeed, in our sparse sampling, we have discovered at least one example protein 
(71L) exhibiting native-like folding characteristics. 

These results are similar to and consistent with those of computational and laboratory experiments 
comparing the folded structures of random-sequence single-stranded RNAs [51-53]. For example, 
in silico folding of randomized RNA sequences demonstrates that unevolved RNAs have the same 
structure-dependent compositional biases as those observed for rRNAs [54-56]. Li laboratory experiments 
using a battery of complementary structural probes (in this case, native gel electrophoresis, 
analytical centrifiigation and lead-II hydrolysis), it was found that unevolved RNAs (having sequence 
length and nucleotide composition analogous to evolved sequences) adopted sequence-specific, 
magnesium-dependent folding to states as compact as those documented for cognate biological 
RNAs [57]. Yet, the authors conclude that the secondary structural elements of these unevolved RNAs 
failed to attain the unique tertiary contacts that characterize native RNA folds. This observation that 
typical random-sequence RNA adopts a relatively small number of compact states is directly analogous 
to the molten globule-like states described herein for random-sequence proteins. Hence, rudimentary 
folding appears to be a common feature of both RNA and protein polymeric systems. 

3. Experimental Section 

3.1. Library Construction 

A detailed description of the library design and construction is published elsewhere [30,34]. Briefly, 
the cloning procedure (summarized in Figure 1) utilized synthetic oligonucleotides each containing a 
variable region flanked on each end by constant primer-binding sequences. The randomized codons 
were constructed by adding premixed nucleoside phosphoramidites during the elongation step of 
standard oligonucleotide synthesis. Thus, the exact sequence of any individual molecule was 
determined partially by chance, while the ensemble average was dictated by the input nucleotide ratios. 
The nucleotide mixtures were designed to bias the probabilities away from termination codons and 
toward an amino acid composition similar to the mean for natural, globular proteins [30]. In the 
present libraries, the input molar ratios for T:C:A:G (in percentage) were 8:21:32:39 in the first codon 
position, 24:25:28:23 in the second position, and 30:30:0:40 in the third. The reactivities of the four 
nucleoside phosphoramidites were shown previously to be essentially equal in cases where fresh, 
anhydrous solutions were used [58-60]. 

The constant regions of the synthetic DNA contained, on the 3 ' end, a Bam HI restriction site and 
on the 5' end, a Bgl II site. These enzymes generate compatible overhangs and were used to ensure 
proper orientation of the fragments in the plasmid while allowing for stepwise addition of an unlimited 
number of inserts. The actual length of any individual fusion protein is determined not only by the 
number of inserted fragments but also by chance, since the probability of encoding a termination 
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codon is less than 0.007 per codon in these libraries. Protein samples were produced and purified for 
two fiision libraries, LIB38 (30,000 clones) and LIB71 (19,000 clones), having one or two inserts, 
respectively and for unmodified ubiquitin (produced from pNMHUB, identical to pNMHUBpoly 
except the poly-linker following the ubiquitin gene is replaced by an in-frame termination codon). 
Expression and purification of the fusions proteins is outlined in the methods section below. Proteins 
were shown to be homogeneous by Coomassie blue stained, overloaded SDS-PAGE. Experimental 
analysis of the overall amino acid compositions of the fusion protein pools was described previously 
and indicated a rich, well-balanced amino acid composition coincident with the biological region of 
sequence we targeted for investigation [34]. 

3.2. Protein Purification 

Plasmid libraries were transformed into hexamminecobalt chloride treated E. coli AR68 [61], 
a protease deficient strain which constitutively expresses a heat labile >.-phage cl repressor protein [62]. 
Transformants were grown on LB plates containing 75 )ag/mL ampicillin, counted to establish 
library diversity, and suspended in LB medium. Individuals and pools of clones were grown at 30 °C 
then heat shocked at 42 °C to induce production of the fusions. Cells were lysed by sonication 
and the supernatant from a high-speed centrifugation was passed over a Q-Sepharose FPLC column 
(Pharmacia LKB) in 20 mM Tris/HCl, pH 7.5, 50 mM NaCl, 0.03% sodium azide. Ubiquitin and the 
ubiquitin fusions passed through the column without binding; peak fractions were collected, 
concentrated by ultrafiltration, and passed over a Sephadex G-50 FPLC column. This gel permeation 
step, in most cases, yielded a peak of fiision protein shown to be pure by polyacrylamide electrophoresis. 
If necessary, peak fractions were concentrated and the G-50 step repeated. 

3.3. Fusion Protein Hydrolysis 

The fiision Ub71L was digested with UbCE hydrolase LI [63] at 37 °C in 20 mM DTT, 20 mM 
TrisHCl, pH 7.5, 50 mM NaCl for 40 hours, then denatured in 5 M GuHCl, and refolded into 
12.5 mM KH2PO4, 50 mM NaF, pH 7.5 by buffer exchange using a Centra-Con 3 ultrafiltration unit 
(Amicon). Cleavage of the novel carboxy extension protein from ubiquitin was shown to be complete 
by SDS-PAGE. 

3.4. Spectroscopy 

CD spectra were taken on an Aviv 62DS spectropolarimeter. Five or ten repeat scans were collected 
per sample with 0.5 nm step size and 2 second averaging time. Proteins (10-25 )J,M) were in 12.5 mM 
KH2PO4, 50 mM NaF, pH 7.5, at 25 °C in either a 1 or 2 mm path cuvette. The normalization and 
deconvolution of CD spectra is critically dependent upon the protein concentration values used in 
normalizing the curves, therefore, samples were analyzed in triplicate using the micro-BCA assay 
(Pierce, Rockford, IL). Protein concentrations measured by BCA assay were verified by comparison 
with quantitative amino acid analysis results for several samples. 

FE spectra were obtained using an SLM Spectrofiuorimeter. A rectangular micro-cuvette was used 
with 130 )j.L protein sample (5-10 \iM) in water and in 6 M GuHCl. Data were collected from 300 to 
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420 nm with 1 nm step size, 280 nm excitation wavelength. In the FE monitored unfolding 
experiments aliquots of 8 M GuHCI were added manually to the cuvette, mixed by pipette trituration, 
and equilibrated for 1-2 minutes before collecting the spectra. Spectra were normalized by baseline 
subtraction and dilution correction. Samples in the ANS binding experiments contained 10 \iM protein 
and 250 [xM ANS in 12.5 mM KH2PO4, 50 mM NaF, pH 7.5 containing either no GuHCl or 2 M 
GuHCl. Data were collected over the range 420 to 580 nm with 340 nm excitation. 

4, Conclusions 

In this study, we have created combinatorial protein pools with amino acid compositions near those 
observed in globular proteins found in nature [30]. An intriguing future direction would be to 
synthesize combinatorial pools having amino acid compositions similar to other classes of protein 
structures (such as fibrous [64], membrane [65], and the more recently discovered intrinsically 
unstructured [66] proteins). Might random protein sequences having these distinct amino acid 
compositions have correspondingly distinct folding properties? Ultimately, it will be necessary to 
explore the vast compositional regimes of sequence space that are not occupied by biological 
sequences. Would these regions display lower (or possibly higher) frequencies of native folding? Of 
rudimentary folding? 

It has been assumed that biological proteins fold into specific three-dimensional conformations 
because evolution has acted over billions of years to select those rare sequences having this capability. 
The evidence presented herein indicates that synthetic proteins having unevolved sequences possess 
many features of folding usually considered to be derived evolutionary adaptations. Although selection 
is essential for the evolution of specific, functional protein folds, it is not required to account for 
secondary structure and compact, cooperative folding. 
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